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Abstract 

This paper addresses how well we can recover a data matrix when only given a few of its elements. We 
present a randomized algorithm that element-wise sparsifies the data, retaining only a few its elements. 
Our new algorithm independently samples the data using sampling probabilities that depend on both 
the squares (£2 sampling) and absolute values (ii sampling) of the entries. We prove that the hybrid 
algorithm recovers a near-PCA reconstruction of the data from a sublinear sample-size; hybrid-(f 1, (.2) 
inherits the f2-ability to sample the important elements as well as the regularization properties of 
sampling, and gives strictly better performance than either or £2 on their own. We also give a 
one-pass version of our algorithm and show experiments to corroborate the theory. 


1 Introduction 


We address the problem of recovering a near-PCA reconstruction of the data fr om just a few of its entries 
- element-wise matrix sparsification (lAchlioptas and McSherrvI (1200 iL 120071) ). Read: you have a small 
sample of data points and those data points have missing features. This is a situation that one is confronted 
with all too often in machine learning. For example, with user-recommendation data, one does not have 
all the ratings of any given user. Or in a privacy preserving setting, a client may not want to give you all 
entries in the data matrix. In such a setting, our goal is to show that if the samples that you do get are 
chosen carefully, the top-A: PCA features of the data can be recovered within some provable error bounds. 


More formally, the data matrix is A G 


(m data points in n dimensions). Often, real data 


matrices have low effective rank, so let A^ be the best rank-A: approximation to A with || A — A^ II2 being 
small. Afc is obtained by projecting A onto the subspace spanned by its top-A: principal components. In 
order to approximate this top-A; principal subspace, we adopt the following strategy. Select a small number, 
s, of elements from A and produce a sparse sketch A; use the sparse sketch A to approximate the top-A: 
singular subspace. In SectionlH we give the details of the algorithm and the theoretical guarantees on how 
well we recover the top-A: principal subspace. The key quantity that one must control to recover a close 
approximation to PCA is how well the sparse sketch approximates the data in the operator norm. That is, 
if IIA — A||2 is small then you can recover PCA effectively. 


Problem: sparse sampling of data elements 

Given A G and e > 0, sample a small number of elements s to obtain a sparse sketch A for which 

||A — AII2 < e and ||A||o < s. ( 1 ) 
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Our main result addresses the problem above. In a nutshell, with only partially observed data that have 
been carefully selected, one can recover an approximation to the top-A: principal subspace. An additional 
benefit is that computing our approximation to the top-A: subspace using iterated multiplication can benefit 
computationally from sparsity. To construct A, we use a general randomized approach which indepen¬ 
dently samples (and rescales) s elements from A using probability pij to sample element Ajj. We analyze 
in detail the casepij oc a| Ajj|-|-(1 — a)| Ajjp to get abound on ||A —A|| 2 . We now make our discussion 
precise, starting with our notation. 


1.1 Notation 


We use bold uppercase (e.g., X) for matrices and bold lowercase (e.g., x) for column vectors. The i- 
th row of X is Xri^, and the f-th column of X is X^*). Let [n] denote the set {1,2, ...,n}. E(A) is 


the expectation of a random variable X; for a matrix, E(X) denotes the element-wise expectation. For 
a matrix X G the Frobenius norm ||X||^ is ||X||^ = 


and the spectral (operator) 


2 IS ||.x .||2 = inax||y||^=;^ ||A.y|| 2 . we atso nave tne ana to norms: ||A.||^ = kmji 

Q (the number of non-zero entries in X). The A:-th largest singular value of X is cTfc(X). For 


|X ||2 = maxiiyii ||Xy|| 2 . We also have the ii and io norms: ||X||^ = YlTi=i 1^* 


norm ||X 
and ||X 

symmetric matrices X, Y, Y A X if and only if Y — X is positive semi-definite. I„ is the nx n identity 
and In x is the natural logarithm of x. We use Oj to denote standard basis vectors whose dimensions will 
be clear from the context. 


(1201 3h ) and £2 (Pij = Mj/ II^ 


are£i (pa = |Ai,|/ ||A||i 

kchlioptas and McSherrv (2001 

1: Achlioptas et al. 

Achlioptas and McSherrv 

(200 lb: 

Drineas and Zouzias 

(^ 

>011 

)). We 


construct A as follows: A^ = 0 if the (z, j)-th entry is not sampled; sampled elements Ajj are rescaled 
to Aij = Aijjpij which makes the sketch A an unbiased estimator of A, so E[A] = A. The sketch 
is sparse if the number of sampled elements is sublinear, s = o{mn). Sampling according to element 
magnitudes is natural in many applications, for example in a recommendation system users tend to rate a 
product they either like (high positive) or dislike (high negative). 

Our main sparsification algorithm (Algorithm [Til receives as input a matrix A and an accuracy pa¬ 
rameter e > 0, and samples s elements from A in s independent, identically distributed trials with re¬ 
placement, according to a hybrid-(^ 1 ,^ 2 ) probability distribution specified in equation The algorithm 
returns A G a sparse and unbiased estimator of A, as a solution to dUl. 


1.2 Prior work 


Achlioptas and McSherrvI (120011.120071) pioneered the idea of £2 sampling for element-wise sparsification. 
Howev er, £2 sampling on its own i s not enough for provably accurate bounds for || A — A|| 2 . As a matter 
of fact [Achlioptas and McSherrvI (I 2 OOIL l2007l) observed that “small” entries need to be sampled with 
probabilities that depend on their absolute values only, thus also introducing the notion of £i sampling. 
The underlying reason for the need of £i sampling is the fact that if a small element is sampled and 
rescaled using £2 sampling, this would result in a huge entry in A (because of the rescaling). As a result, 
the variance of £2 sampling is quite high, resulting in poor theoretical and experimental behavior. £i 
sa mpling of srnall entr ies rectifies this issue by reducing the variance of the overall approach. 


Arora et al.l (120060 proposed a sparsification algorithm that deterministically keeps large entries, i.e., 
entries of A such that | Ajj| > e/yTz and randomly rounds the remaining entries using £i sampling. For¬ 
mally, entries of A that are smaller than e^/n are set to sign [Aij ) e /^/n with probability pij = ^/n \ A^j \ /e 
and to z ero otherwise. They used an e-net argument to show that || A — A ||2 was bounded with high prob¬ 
ability. IDrineas and ZouziasI (l2011t) bypassed the need for £i sampling by zeroing-out the small entries of 
A (e.g., all entries such that | Ajj | < e/2n for a matrix A G and then use £2 sampling on the remain¬ 

ing en tries in order to sparsify the matrix. This simple modification improves lAchlioptas and McSherry 
(I 2 OO 7 I) and lArora et al.l (I 2 OO 61 ) . and comes with an elegant proof using the matrix-Bernstein inequality 
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oflRechti(l201lh . Note that all these approaches need truncation of small entries. Recently. lAchlioptas et al 


(12013h showed that sampling in isolation could be done without any truncation, and argued that (under 
certain assumptions) £i sampling would be better than I 2 sampling, eve n using the truncation. Their proof 
is also based on the matrix-valued Bernstein inequality of lRecht] (1201 ih . 


1.3 Our Contributions 


We introduce an intuitive hybrid approach to element-wise matrix sparsification, by combining li and £2 
sampling. We propose to use sampling probabilities of the form 


Pij = Oi ■ 




-h (1 - «)■ 


A2 


|2 ’ 


a G (0,1] 


( 2 ) 


for all 0. We essentially retain the good properties of £2 sampling that bias us towards data elements 
in the presence of small noise, while regularizing smaller entries using £i sampling. The proof of the 
quality-of-approximation result of Algorithm 1 (i.e. Theorem [U uses the matrix-Bernstein LemmaQ] We 
summarize the main contributions below: 

• We give a parameterized sampling distribution in the variable a G (0,1] that controls the balance 
between £2 sampling and £\ regularization. This greater flexibility allows us to achieve greater accuracy. 

• We derive the optimal hybrid-(f'i,f' 2 ) distribution, using Lemma [T] for arbitrary A, by computing 
the optimal parameter a* which produces the desired accuracy with smallest sample size according to our 
theoretical bound. 

Our result generaliz es the existing results because setting a = 1 in our bounds reproduces the result of 
Achlioptas et al.l (120 13h who claim that £i sampling is almost always better than £2 sampling. Our results 
show that a* < 1 which means that the hybrid approach is best. 

• We give a provable algorithm (Algorithm |2ll to implement hybrid-(f'i, £ 2 ) sampling without knowing 
a a priori, i.e., we need not ‘fix’ the distribution using some predetermined value of a at the beginning 
of the sampling process. We can set a at a later stage, yet we can realize hybrid-(£ 1 ,^ 2 ) sampling. We 
use Algorithm |2] to propose a pass-efficient element-wise sampling model using only one pass over the 
elements of the data A, using 0{s) memory. Moreover, Algorithmic] gives us a heuristic to estimate a* in 
one-pass over the data using 0{s) memory. 

• Finally, we propose the Algorithm|4]which provably recovers PCA by constructing a sparse unbiased 
estimator of (centered) data using our optimal hybrid-(^ 1 ,^ 2 ) sampling. 

Experimental results suggest that our optimal hybrid distribution (using a*) requires strictly smaller 
sample size than £i and £2 sampling (with or without truncation) to solve (HJ. Also, we achieve significant 
speed up of PCA on sparsified synthetic and real data while maintaining high quality approximation. 


1.3.1 A Motivating Example for Hybrid- (f'l, ^ 2 ) Sampling 

The main motivation for introducing the idea of hybrid-(£ 1 ,^ 2 ) sampling on elements of A comes from 
achieving a tighter bound on s using a simple and intuitive probability distribution on elements of A. 
For this, we observe certain good properties of both £i and £2 sampling for sparsification of noisy data (in 
practice, we experience data that are noisy, and it is perhaps impossible to separate “true” data from noise). 
We illustrate the behavior of £i and £2 sampling on noisy data using the following synthetic example. We 
construct a 500 x 500 binary data D (Figure [T|l, and then perturb it by a random Gaussian matrix N whose 
elements follow Gaussian distribution with mean zero and standard deviation 0.1. We denote this 
perturbed data matrix by Aq.i- First, we note that £i and £2 sampling work identically on binary data D. 

'co mbining li and £2 probabilities to avoid zeroing out step of £2 sampling has recently been observed bv iKundu and DrineasI 
j2014|) . 
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However, Figure [2] depicts the change in behavior of and I 2 sampling sparsifying Aq.i. Data elements 
and noise in Aq.i are the elements with non-zero and zero values in D, respectively. We sample s = 5000 
indices in i.i.d. trials according to l\ and I 2 probabilities separately to produce sparse sketch A. Figure |2] 
shows that elements of A, produced by l\ sampling, have controlled variance but most of them are noise. 
On the other hand, I 2 sampling is biased towards data elements, although small number of sampled noisy 
elements create large variance due to rescaling. Our hybrid-(^ 1 , ^ 2 ) sampling benefits from this bias of £2 
towards data elements, as well as, regularization properties of i\. 



(a) 



Sampled elements 

(b) I 2 



cc 


■'0 1000 2000 3000 4000 5000 

Sampled elements 

(c) Hybrid-(£i,£2) 


Figure 2: Elements of sparse sketch A produced from Aq.i via (a) sampling, (b) £2 sampling, and 
(c) hybrid-(£ 1 , £ 2 ) sampling with a = 0.7. The y-axis plots the rescaled absolute values (in In scale) 
of A corresponding to the sampled indices. i\ sampling produces elements with controlled variance but 
it mostly samples noise, whereas I 2 samples a lot of data although producing large variance of rescaled 
elements. Hybrid-(^ 1 , ^ 2 ) sampling uses li as a regularizer while sampling a fairly large number of data 
that helps to preserve the structure of original data. 


We parameterize our distribution using the variable a G (0,1] that controls the balance between £2 
sampling and li regularization. We derive an expression to compute a*, the optimal a, corresponding to 
the smallest sample siz e that we need in order to achieve a given accuracy e in ([Til. Setting a = 1, we 
reproduce the result of lAchlioptas et al.l (120 13h . However, a* may be smaller t han 1 , and the bound on 
sample size s, using a*, is guaranteed to be tighter than that of lAchlioptas et al.l(l2013h . 


2 Main Result 

We present the quality-of-approximation result of our main algorithm (Algorithm 1). We define the sam¬ 
pling operator Sq : —)■ in (l3]l that extracts elements from a given matrix A € Let H 

be a multi-set of sampled indices {it, jt), for t = 1,..., s. Then, 

5n(A) = i^^e,,ej;, iit,jt)en (3) 

t=l 
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Algorithm 1 randomly samples (in i.i.d. trials) s elements of a given matrix A, according to a probability 
distribution {pij}^jZi over the elements of A. Let the pij’s be as in eqn. Q. Then, we can prove the 
following theorem. 

Theorem 1 Let A G and let e > 0 be an accuracy parameter. Let Sq be the sampling operator 

defined in ( 121 ), and assume that the multiset is generated using sampling probabilities as in 

Then, with probability at least 1 — 5, 


if 


||‘5n(A) AII 2 < e IIAII 2 , 


(4) 


s > 


£2 IIAI 


(p 2 (a) + 7 (a)e||A|| 2 / 3 ) In 


m + n 


(5) 


where. 


= l|A||^/ 


a ■ ||A| 


l-^*il ■ ll-^lli 


+ (!-«)> forAij / 0 , 


’(a) = max I max^^ij,max^^y > - cr^j„(A), 


i=i 


i=l 


7(a) = max 


A;;y„|a+{i-a)Ay^^ 


+ l|A| 


2 ! 


amin{A) is the smallest singular value of A. Moreover, we can find a* (optimal a corresponding to the 
smallest s) and s* (the smallest s), by solving the following optimization problem in 


a* = min /(a), /(a) = p^Iq.) _)_ || A|L/3, 

aG(0,l] 


( 6 ) 


s = 


p\a*)+y{a*)- 


In 


m + n 


(V) 


The functional form in (l5]l comes from the Matrix-Bemstein inequality in Lemma HJ with p^ and 7 being 
functions of A and a. This gives us a flexibility to optimize the sample size with respect to a in ([51), 
which is how we get the optimal a*. For a given matrix A, we can easily compute p^(a) and 7 (a) for 
various values of a. Given an accuracy e and failure probability 5, we ca n compute a* co i respo nding to 
the tightest bound on s. Note that, for a = 1 we reproduce the results of Achlioptas et al. ( 2013h (which 
was expressed using various matrix metrics). However, a* may be smaller than 1, and is guaranteed to 
produce tighter s comparing to extreme choices of a (e.g. a = 1 for £i sampling). We illustrate this by 
the plot in Figure |3] We give a proof of Theorem [T]in Section IZTl 


2.1 Proof of Theorem [T] 


In this section we provi de a proof of Theorem[T]following the proof outline of lPrineas and ZouziasI (l201lh: 
Achliootas et al.l (120131) . We use the following non-commutative matrix-valued Bernstein bound of lRecht 
(I 2 OI ih as our main tool to prove TheoremU] Using our notation we rephrase the matrix Bernstein bound. 
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Algorithm 1 Element-wise Matrix Sparsification 

1: Input: A G accuracy parameter e > 0. 

2 : Set s as in eq. ©. 

3: For t = I... s (i.i.d. trials with replacement) randomly sample pairs of indices G ["i] x [n] 

with P = (*) j)] = Pij^ where pij are as in using a as in ®. 

4: Output(sparse): (A) = i Yft=i 



Figure 3: Plot of /(a) in eqn ® for data Aq.i. We use e = 0.05 and <5 = 0.1. x-axis plots a and y-axis is 
in logio scale. For this data, a* ps 0.6. 


Lemma 1 [Theorem 3.2 o nRecm^lOl n) ! Let Mi, M 2 ,..., M^ be independent, zero-mean random matri¬ 
ces in Suppose 

max{||E(MiMf)||2, \\E{MjMt)\\^} < 

iG[s] ^ ^ 

and ||Mi ||2 < 'y for all t € [s]. Then, for any e > 0, 


1 




t=i 


< e 


holds, subject to a failure probability at most 

(m + n) exp 


-se‘^/2 

+ 7e/3 J ' 


Ml = -^epel - A. 
Pitjt 


It now follows that 


1 




=1 ^ i=i L Phjt 


Jlilf.. (J' - A 


= SniA) - A. 


We can bound ||Mi ||2 for all f € [s]. We define the following quantity: 

A = ^ for Aij / 0 


(8) 


Lemma 2 Using our notation, and using probabilities of the form (O, for all t G [s], 


|Mi||2 < max 


i,j. a + (1 — alA 


+ l|A| 


2 • 
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Proof: Using probabilities of the form (ID), and because A*, = 0 is never sampled, 


\Mt\\2 = 


lllle- - A 


< max 


a (1 — a) • I Ajjl 

A II “b TTo 


-1 


+ l|A| 


Pkjt 

Using dUl, we obtain the bound. 

Next we bound the spectral norm of the expectation of MjM^. 

Lemma 3 Using our notation, and using probabilities of the form ([2]), for all t G [s], 


|lE(MiMf)||2< ||A||^/3i-a^,JA), 


where, 


Pi = max > 

i * ^ 


a ■ IIAI 


^ ^ \ I 1 1 

)=i V . .. 


+ (1 - a) , for Aij / 0. 


Proof: Recall that A = J2Tj=i ^k^Jt ~ A to derive 


E[MtMf ] = ^ pij ( —GieJ - A 


i,j=t 

m,n 


Pij 




Pij 


Bje^ - A 


Af, 


i,j=i \ 


Biei - AAU 


Sampling according to probabilities of eqn. Q, and because Aij = 0 is never sampled, we get, for 

Aij / 0 , 


Thus, 


m,n .2 

^ij 
-1 Pij 

i,j=l ■> 


|A|||' ^ 


m^n / II . ||2 '' 

OL ' A. \ \ rp . A 

" +(l-a) 


-1 


*j=i 


A^j I ‘ 11 AI 


1 


< IIA11^ max 


a ■ ||A| 


-1 


i=i j=i 


^ij I * 11 AI 


+ (1 - a) 


ElM^Mf]^ ||A||^/3i^e,ef-AA^= || A||^/3il™ - AA^. 


2 = 1 


Note that, ||A||p/3iIm is a diagonal matrix with all entries non-negative, and AA^ is a postive semi- 
definite matrix. Therefore, 

||E[M*Mf]|| 2 < ||A||^/3i-aLjA). 


Similarly, we can obtain 


|E[MfMi]|L< ||A||^/32 -ctLJA), 
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where, 


-1 


^2 = max > 

-1 ^ ^ 


OL 


+ (!-«) , for Ay / 0. 


Z=1 


We can now apply Theorem [T] with 

p^{a) = ||A||^max{/3i,/32} -o-^j„(A) 

and 


7 (a) = 


+ l|A| 


a -f (1 — a)A 

to conclude that ||5 q(A) — AII 2 < e holds subject to a failure probability at most 

(m + n) exp ((-se^/2)/(p^(a) + 7 (a)e/ 3 )) . 

Bounding the failure probability by b, and setting e = e ■ ||A|| 2 ,we complete the proof. 


3 One-pass Hybrid- (^ 1 , ^ 2 ) Sampling 


Here we discuss the implementation of (^i, £ 2 )-hybrid sampling in one pass over the input matrix A using 
0(s) memory, that is, a streaming model. We know that both £1 and io sampl ing can be done in one pass 
using 0{s) memory (see Algorithm SELECT p. 137 of iDrineas et all (l2006h ). In our hybrid sampling, 
we want parameter a to depend on data elements, i.e., we do not want to ‘fix’ it prior to the arrival of data 
stream. Here we give an algorithm (Algorithm |2ll to implement a one-pass version of the hybrid sampling 
without knowing a a priori. 

We note that steps 2-5 of Algorithm |2] access the elements of A only once, in parallel, to form inde¬ 
pendent multisets Si, S 2 , S 3 , and ^ 4 . Step 6 computes || A||^ and || A||^ in parallel in one pass over A. 
Subsequent steps do not need to access A anymore. Interestingly, we set a in step 7 when the data stream 
is gone. Steps 10-16 sample s elements from 5i and S 2 based on the a in step 7, and produce sparse 
matrix X based on the sampled entries in random multiset S. Theorem |2] shows that Algorithm |2] indeed 
samples elements from A according to the hybrid-(£ 1 , £ 2 ) probabilities in eqn dUl. 

Theorem 2 Using the notations in Algorithm 2, for a G (0,1], t = 1,..., s, 


P[S{f) = (i,j, Ay)] = a-pi + {l-a)-p2 , 
where pi = and p 2 = ■ 

Proof: Here we use the notations in Theorem |2] Note that f-th elements of Si and S 2 are sampled 
independently with li and ^2 probabilities, respectively. We consider the following disjoint events: 


81 : Si{t) = (z,j, Ay) A S'2(f) 7^ 

82 ■ Si{t) 7^ (i,j, Ay) A 5 ' 2 (f) = {i,j,Aij) 

83 : Si{t) = (z,j, Ay) A 82(1) = {i,j,Aij) 

84 : Si{t) 7^ (z,j, Ay) AS2{t) 7^ {i,j,Aij) 

Let us denote the events xi : x > a and X 2 ■ x < a. Clearly, P [xi] = a, P [X 2 ] = 1 — a. Since the 
elements Si {t) and S2 {t) are sampled independently, we have 

P[8i] = P[Si{t) = {i,j,Aij)]P[S2{t) {i,j,Aij)]=pi{l-p2) 

P[82] = {l-pi)p2 

P [£3] = P1P2 

P[ 84 ] = {l-pi){l-p2) 
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Algorithm 2 One-pass hybrid-(£i, £ 2 ) sampling 


1 : 

2 : 


3: 

4: 


5: 

6 : 

7: 

8 : 

9: 

10 : 

11 : 

12 : 

13: 

14: 

15: 

16: 

17: 


Input: Aij for all (i, j) G [m] x [n], arbitrarily ordered, and sample size s. 

Apply SELECT algorithm in parallel with 0{s) memory using probabilities to sample s inde¬ 
pendent indices and corresponding elements to form random multiset Si of triples 

for 

Run step 2 in parallel to form another independent multiset of triples for ts = 

1,s. (This step is only for Algorithm |3]l 

Apply SELECT algorithm in parallel with 0{s) memory using £2 probabilities to sample s inde¬ 
pendent indices {it 2 i 3 t 2 ) and corresponding elements to form random multiset Si of triples 

forfa = 

Run step 4 in parallel to form another independent multiset S^i of triples , jt ^, ), for = 

1,s. (This step is only for Algorithm |3]l 
Compute and store ||A||^and ||A||j^ in parallel. 

Set the value of a G (0,1] (using Algorithm [3]l. 

Create empty multiset of triples S. 

^ ^ Omxn- 


For f = 1... s 

Generate a uniform random number x G [0,1]. 
if X > a, S{t) ^ S'i(f); otherwise, S{t) t— S 2 {t). 
ikJt) <- S{t,l : 2). 


a 


\Sit, 3 )\ 

IIAII, 


-|- (1 — cr) 


\s(t,3r 


p ^ 

X 
End 

Output: random multiset S, and sparse matrix X. 


X 4- 

p-s 


We note that a may be dependent on the elements of S^ and ^4 (in Algorithm O, but is independent of 
elements of Si and Sa- Therefore, events xi and xa are independent of the events Ej, j = 1, 2,3,4. Thus, 

P[S(f) = (i,j,A,,)] 

= P [(£:i A Xi) V (£^2 A Xa) V £:3] 

= P \£i A xi] -|- P [£2 A xa] -|- P [£"3] 

= P [£i]P [xi] + P [£2]P [xa] + P [^3] 

= pi{l - p2)a + {1 - pi)p2{l - a) + P 1 P 2 
= a ■ Pi + {1 - a) ■ p 2 


o 

Note that. Theorem |2] holds for any arbitrary a G (0,1] in line 7 of Algorithm |2l i.e.. Algorithm [3] 
is not essential for correctness of Theorem |2l We only need a to be independent of elements of 5i and 
Pa- However, we use Algorithm [3] to get an iterative estimate of a* (Section |3Tll in one pass over A. In 
this case, we need additional independent multisets P 3 and P 4 to ‘learn’ the parameter a*. Algorithmic 
(without Algorithm IC requires a memory twice as large required by £i or £2 sampling. Using Algorithm 
|3]this requirement is four times as large. However, in both the cases the asymptotic memory requirement 
remains the same 0 (s). 
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Algorithm 3 Iterative estimate of a* 


1 : 

2 : 

3: 

4: 

5: 

6 : 

7: 

8 : 

9: 

10 : 

11 : 

12 : 

13: 

14: 

15: 


Input: Multiset of triples and ^4 with s elements each, number of iteration r, accuracy e, 
and ||A||j^. 

Create empty multiset of triples S. 

ao = 0.5 

For k = 1.. .T 

^ ^ Omxn- 


For t = l.. .s 

Generate a uniform random number x G [0,1]. 

If X > CKfc-i, S{t) •(— Sz{t)-, else, S{t) •(— S 4 ^{t). 
{it,it) ^ S{t, I ■. 2). 

\s{tM 

IIAII, 


p-s 


p t- ak-i ■ 

X ^ 

End 

afc a in (O using X. 

End 

Output: Ur- 


+ (1 ~ Ofc-l) 


15 ( 1 , 3 ) 1 ^ 

iiaii2 


A 


2 

F’ 


3.1 Iterative Estimate of a* 


We obtain independent random multiset of triples S 3 and S'4, each containing s elements from A in one 
pass, in Algorithm |2l We can create a sparse random matrix X, as shown in step 11 in Algorithm [3l that 
is an unbiased estimator of A. We use this X as a proxy for A to estimate the quantities we need in order 
to solve the optimization problem in 


a : min {(p^(a:) +7(a)e||X||2/3)} 

“S( 0 , 1 ] 


where, for all {i,j) G S{:, 1 : 2) 


Cij = ml/ 


a - ||X||^ 
||X||, 


+ (1 — a) , 


(9) 


p^(a) = max 


max 
i 


^ 4 , max ^4' 


i=i 


i=l 


7 (a) = max 
ij 


|X| 


a + (1 — a)- 


■|Xid 


+ l|X| 


IIf 


We note that ||X llo - We can compute the quantities p{a) and 7 (a), for a fixed a, using 0(s) memory. 
We consider e = e ■ ||X ||2 to be the given accuracy. 


4 Fast Approximation of PCA 

Here, we discuss a provable algorithm (Algorithm 14]) to speed up computation of PCA applying element¬ 
wise sampling. We sparsify a given centered data A to produce a sparse unbiased estimator A by sampling 
s elements in i.i.d. trials according to our hybrid-(£ 1 , £ 2 ) distribution in dill. Computation of rank-truncated 
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Algorithm 4 Fast Approximation of PCA 

1: Input: Centered data A E sparsity parameter s > 0, and rank parameter k. 

2: Produce sparse unbiased estimator A from A, in s i.i.d. trials using Algorithm[I] 

3: Perform rank truncated SVD on sparse matrix A, i.e., [Ufc, Dfc, V^] = SVD(A, k). 
4: Output: Vfc (columns of are the ordered PCA’s). 


SVD on sparse data is fast, and we consider the right singular vectors of A as the approximate principal 
components of A. Naturally, more samples produce better approximation. However, this reduces sparsity, 
and consequently we lose the speed advantage. Theorem[3]shows the quality of approximation of principal 
components produced by Algorithmic 

Theorem 3 Let A E be a given matrix, and A. be a sparse sketch produced by Algorithm\J} Let 

k be the PCA’s of A computed in step 3 of Algorithm^ Then 


A - AVkV, 


< 


\A-Ak 


,2 4 

\f + 


A-A 


Ak — Ai 


A-Ai 


< 


< 


|A-A 


A-A 


|A-A 


o-fc(A) 

\A — Afc||2 + 

+ V 8k ■ ^ I 

The first inequality of Theorem [3] bounds the approximation of projected data onto the space spanned by 
top k approximate PCA’s. The second and third inequalities measure the quality of Ak as a surrogate for 
Ak and the quality of projection of sparsified data onto approximate PCA’s, respectively.^ _ 


k \\2 


+ 


A-A 


Proofs of first two inequalities of Theorem[3]follow from Theorem 5 and Theorem 8 of lAchlioptas and McSherry 


(120011) . respectively. The last inequality follows from the triangle inequality. The last two inequalities 
above are particularly useful in cases where A is inherently low-rank and we choose an appropriate k for 
approximation, for which || A — Afc ||2 is small. 


5 Experiments 

In this section we perform various element-wise sampling experiments on synthetic and real data to show 
how well the sparse sketches preserve the structure of the original data, in spectral norm. Also, we show 
results on the quality of the PCA’s derived from sparse sketches. 


5.1 Algorithms for Sparse Sketches 


We use AlgorithmUas a prototypical algorithm to produce sparse sketches from a given matrix via various 
sampling methods. Note that, we can plug-in any element-wise probability distribution in Algorithm [T] to 
produce (unbiased) sparse matrices. We construct sparse sketches via our optimal hybrid-(£i, £ 2 ) sampling, 
along with other sampling methods rel ated to extreme cho ices of a, such as, £i sampling for a = 1. Also, 
we use element-wise leverage scores dChen et all (l2014l) l for sparsification o f low-rank data. E lement- 
wise leverage scores are used in the context of low-rank matrix completion bv IChen et al.l (l2014h . Let A 
be a m X n matrix of rank p, and its SVD if given by A = USV^. Then, we define pi (row leverage 
scores), Vj (column leverage scores), and element-wise leverage scores pi^v as follows: 


pi ||^(*) I 


Ui = V 


(i) 


Plev — 


Pi + Vj 
{m -\- n)p' 


i E [m\,j E [n] 


Note that pi^v is a probability distribution on the elements of A. Leverage scores become uniform if the 
matrix A is full rank. We use pi^.^, in Algorithm [T] to produce sparse sketch A of a low-rank data A. 
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5.1.1 Experimental Design for Sparse Sketches 


We compute the theoretical optimal mixing parameter a* by solving eqn Q for v arious datasets. We 
compare this a* with the theoretical condition derived by lAchlioptas et all (120131) (for cases when ii 
sampling outperforms £2 sampling). We verify the accuracy of a* by measuring the quality of the sparse 
sketches K, £ = ||A — A||2/||A||2 for various sampling distributions. Let and Ei^v denote the 

quality of sparse sketches produced via optimal hybrid sampling, i\ sampling, and element-wise leverage 
scores "piev, respectively. We compare Eh, E\, and Ei^.^ for various sample sizes for real and synthetic 
datasets. 


5.2 Algorithms for Fast PCA 

We compare three algorithms for computing PCA of the centered data. Let the actual PCA of the original 
data be A. We use Algorithmic to compute approximate PCA via our optimal hybrid-(^ 1 ,^ 2 ) sampling. 
Let us denote this approximate PCA by T-L. Also, we compute PCA of a Gaussian random projection of 
the original data to compare the quality of Let Aq = GA G where A G ]^"ix« the original 

data, and G is a r x m standard Gaussian matrix. Let the PCA of this random projection Ag be Q. Also, 
let Ta, Th, and Tq be the computation time (in milliseconds) for A, T-L, and Q, respectively. 

5.2.1 Experimental Design for Fast PCA 

We compare the visual quality of A, T-L, and Q for image datasets. Also, we compare the computation time 
Ta, Th, and Tq for these datasets. 


5.3 Description of Data 

In this section we describe the synthetic and real datasets we use in our experiments. 

5.3.1 Synthetic Data 

We construct a binary 500 x 500 image data D (see Figure [T]). We add random noise to perturb the 
elements of the ‘pure’ data D. Specifically, we construct a 500 x 500 noise matrix N whose elements 
are drawn i.i.d from Gaussian with mean zero and standard deviation a. We use two different values for a 
in our experiments: a = 0.05 and a = 0.10. For each a, we note the following ratios: 

Noise-to-signal energy ratio = ||N||^/||D||p, 

Spectral ratio = ||N|| 2 /(Tfc(D), 

where (7^(0) is the A;-th largest singular value of D. For a = 0.05 and a = 0.10, average Noise-to-signal 
energy ratio are 0.44 and 0.88, average Spectral ratio are 0.09 and 0.17, and average maximum absolute 
values of noise turn out to be 0.25 and 0.50, respectively. We denote noisy data by Aq.os (respectively 
Ao.i) when D is perturbed by N whose elements are drawn i.i.d from a Gaussian distribution with 
mean zero and <7 = 0.05 (respectively a = 0.1). 


5.3.2 TechTC Datasets 


These datasets (iGabrilovich and Markovitchl ((2004j)) are bag-of-words features for document-term data 
describing two topics (ids). We choose four such datasets: TechTCl with ids 10567 and 11346, TechTC2 
with ids 10567 and 12121, TechTC3 with ids 11498 and 14517, TechTC4 with ids 11346 and 22294. Rows 
represent documents and columns are the words. We preprocessed the data by removing all the words of 
length four or smaller, and then normalized the rows by dividing each row by its Frobenius norm. The 
following table lists the dimension of the TechTC datasets. 
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Dimension (m x n) 

m 

n 

TechTC 1 

139 

15170 

TechTC2 

138 

11859 

TechTC3 

125 

15485 

TechTC4 

125 

14392 


Table 1: Dimension of TechTC datasets 


5.3.3 Handwritten Digit Data 


A dataset (IHullI (Il994f) ) of three handwritten digits: six (664 samples), nine (644 samples), and one (1005 
samples). Pixels are treated as features, and pixel values are normalized in [-1,1]. Each 16 x 16 digit 
image is first represented by a column vector by appending the pixels column-wise. Then, we use the 
transpose of this column vector to form a row in the data matrix. The number of rows m = 2313, and 
columns n = 256. 


5.3.4 Stock Data 

We use a stock market dataset (S&P) containing prices of 1218 stocks collected between 1983 and 2011. 
This temporal dataset has 7056 snapshots of stock prices. Thus, we have m = 1218 and n = 7056. 


We provide s umm^ y statistics for all the datasets in Table [2] In order to compare our results with 
Achlioptas et al.l (120 13h we review the matrix metrics that they use. Let the numeric density of matrix X 
be nd(X) = \\X\\l / ||X||p . Clearly, nd(X) < ||X||q, with equality holding for zero-one matrices. The 
row density skew of X is defined as 


maxj ||X(j)||jj 

l|X||o/m 

i.e., fhe rafio befween number of non-zeros in fhe densesf row and fhe average number of non-zeros per 
row. The numeric row densify skew. 



rsi(X) = 


max,- 


X 


(Ol 


is a smoofh analog of rso(X). lAchliopfas ef al.l ( 2013 1 assumed fhaf m < n wifhouf loss of generalify, 
and for simplicify, maxj ||X(j) ||^ > max* ||X^*^ ||^, for all ^ G {0,1, 2}. We notice fhaf, allhough fhe Digil 
dalasel does nol satisfy fhe above condilions, ifs franspose does. We can work on fhe Iransposed dalasef 
wifhouf loss of generalify, and hence we lake nofe of rso and rsi of fhe Iransposed Digil dafa. 


5.4 Results 

We report all fhe resulls based on an average of five independenl Irials. We observe a small variance of fhe 
resulfs. 


5.4.1 Quality of Sparse Sketch 

We firsl nofe fhaf Ihree sampling melhods ii, £ 2 , and hybrid-(^i, £ 2 ), perform identically on noiseless dafa 
D. We report fhe lolal probabilily of sampling noisy elemenls in A = D -|- N (elemenls which are zeros 
in D). £i sampling shows fhe highesl suscepfibilily lo noise, whereas, small-valued noisy elemenls are 


13 

























l|X||o 

nd 

rso 

rsi 

Ao.05 

2.5e+5 

4.4e-i-4 

1 

2.66 

Aq.IO 

2.5e+5 

9.2e-r4 

1 

1.95 

TechTCl 

37831 

12204 

5.14 

2.18 

TechTC2 

29334 

9299 

3.60 

2.10 

TechTC3 

47304 

14201 

7.23 

2.31 

TechTC4 

35018 

10252 

4.99 

2.25 

Digil 

5.9e+5 

5.1e-r5 

1 

1.3 

Slock 

5.5e+6 

6.5e+3 

1.56 

l.le-r03 


Table 2: Summary statistics for the data sets 


suppressed in £2- Hybrid-(£i, £2) sampling, with a < 1 , samples mostly from true data elements, and thus 
captures the low-rank structure of the data better than l\. The optimal mixing parameter a* maintains the 
right balance between £2 sampling and regularization and giv es the smallest sample s ize to achieve a 
desired accuracy. Table |3] summarizes a* for various data sets. lAchlioptas et al.l (1201311 argued that, as 
long as rso(X) > rsi(X), li sampling is better than ^2 (even with truncation). Our results on a* in Table 
|3] confirm this condition. Moreover, our method can derive the right bl end of £1 and £2 sampling even 
when the above condition fails. In this sense, we generalize the results of lAchlioptas et al.l (12013h . 



e = 0.05 

e = 0.75 

rso > rsi 

Ao.05 

0.62 

0.69 

no 

Aq.i 

0.63 

0.70 

no 

TechTCl 

1 

1 

yes 

TechTC2 

1 

1 

yes 

TechTC3 

1 

1 

yes 

TechTC4 

1 

1 

yes 

Digil 

0.20 

0.74 

no 

Slock 

0.74 

0.75 

no 


Table 3 : a* for various data sets (e is the desired relative -error accuracy). The l ast column compares a* 
with the condition established by lAchlioptas et all (1201 3h . Whenever rso > rsi. lAchlioptas et all (120 13h 
show that £1 sampling is always better than I 2 sampling, and we find a* = 1 (£i sampling). However, 
when rso < rsi, a* < 1 and our hybrid sampling is sfricfly heifer. 


Figure IHplols £ = ||A — A|| 2 /||A|| 2 for various values of a and sample size s for various dalasels. If 
clearly shows our optimal hybrid sampling is superior lo £i or £2 sampling. 

We also compare Ihe qualify of sparse skefches produced via our hybrid sampling wifh fhaf of £2 
sampling wifh fruncafion. We use Iwo predefermined fruncalion parameters, e = 0.1 and e = 0.01, for £2 
sampling. Firsf, £2 sampling wifhouf fruncafion lurns oul lo be Ihe worsl for all dalasels. £2 wifh e = 0.01 
appears fo produce sparse skelch A lhal is as bad as £2 wifhouf fruncafion for Aq.i and Aq.os- However, 
£2 wifh e = 0.1 shows heller performance fhan hybrid sampling, for Aq.i and Aq.os, because Ihis choice 
of e lums oul fo be an appropriale fhreshold lo zero-ouf mosl of fhe noisy elemenfs. We musl poinf oul 
lhal, in Ihis example, we conlrol Ihe noise, and we know whal a good fhreshold may look like. However, 
in realily we have no conlrol over Ihe noise. Therefore, choosing Ihe righf fhreshold for £ 2 , wifhouf any 
prior knowledge, is an improbable lask. For real dalasels, if lurns oul lhal hybrid-(f'i, f' 2 )-hybrid sampling 
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(a) Aq.os 


Jk 

*8/ 

■e-s/ 

(X (m+n)) = 1 
(x(m+n)) = 3 
(x(m+n)) = 5 












^ 0 ( 

—. 

—e—e— 


—e —0 


.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 


(c) Digit 



(b) Aq.i 



Figure 4: Approximation quality of sparse sketch A: hybrid-(^i, £ 2 ) sampling, for various a and different 
sample size s, are shown, x-axis is a, and y-axis plots ||A — A|| 2 /||A ||2 (in log 2 scale such that larger 
negative values indicate better quality). Each figure corresponds to a dataset: (a) Aq.os, (b) Aq i, (c) Digit, 
and (d) Stock. We set A: = 5 for synthetic data. A: = 3 for Digit data, and A: = 1 for Stock data. Choice of 
k is close to the stable rank of the data. 


using a* outperforms £2 sampling with the predefined fhresholds for various sample sizes. 

We compare fhe qualify of Algorifhm [3] producing an iferafive esfimafe of a* in a very restricfed sef 
up, i.e., one pass over fhe elemenfs of dafa using 0{s) memory. Table|4]lisfs a, fhe esfimafed a*, for some 
of fhe dafasefs, for fwo choices of s using 10 iferafions. We compare fhese values wifh fhe plofs in Figure 
m where fhe resulfs are generafed wifhouf any resfricfion of size of memory or number of pass over fhe 
elemenfs of fhe dafasefs. 



* — 2 
k-{m+n) 

^ — 3 

k-{m-\-n) 

Aq.os, A: = 5 

0.54 

0.48 

Aq.i, A: = 5 

0.55 

0.5 

Digit, A: = 3 

0.69 

0.89 

Stock, A: = 1 

1 

1 


Table 4: Values of a (esfimafed a* using Algorifhm O for various dafa sefs using one pass over fhe 
elemenfs of dafa and 0(s) memory. We use e = 0.05, S = 0.1. 


Finally, we comp are our hybrid-(f'i, £ 2 ) sampling wifh element-wise leverage score sampling (similar 


mp ; 

m 


to IChen et all (1201411 ) to produce qualify sparse skef ches from low-ran k mafrices. For fhis, we consfrucf 
a 500 X 500 low-rank power-law mafrix, similar to IChen et al.l (l2014ll . as follows: Apo^ = DXY^D, 
where, matrices X and Y are 500 x 5 i.i.d. Gaussian AA(0,1) and D is a diagonal matrix with power-law 
decay, Djj = i~"i, 1 < i < 500. The parameter 7 controls the ‘incoherence’ of the matrix, i.e., larger 
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values of 7 makes the data more ‘spiky’. Table|5]lists the quality of sparse sketches produced via the two 
sampling methods. 



s 

k[m-\-n) 

hybrid-(^ 1 , £ 2 ) 

Plev 

7 = 0.5 

3 

42% 

58% 

5 

31% 

43% 

II 

0 

bo 

3 

15% 

43% 

5 

12% 

40% 

7 = 1.0 

3 

8% 

42% 

5 

6% 

39% 


Table 5: Sparsification quality ||Apo^ — Apou;|| 2 /|| Apoi^lb for low-rank ‘power-law’ matrix Apo^ {k = 
5). We compare the quality of hybrid-(£i,^ 2 ) sampling and leverage score sampling for two sample 
sizes. We note (average) a* of hybrid-(^i,^ 2 ) distribution for data Apo^ using e = 0.05,5 = 0.1. For 
7 = 0.5,0.8,1.0, we have a* = 0.11,0.72,0.8, respectively. 


We note that, with increasing 7 leverage scores get more aligned with the structure of the data, re¬ 
sulting in gradually improving approximation quality, for the same sample size. Larger 7 produces more 
variance in data elements. £2 component of our hybrid distribution bias us towards the larger data ele¬ 
ments, while £1 works as a regularizer to maintain the variance of the sampled (and rescaled) elements. 
With increasing 7 we need more regularization to counter the problem of rescaling. Interestingly, our 
optimal parameter a* adapts itself with this changing structure of data, e.g. for 7 = 0.5, 0.8,1.0, we 
have a* = 0.11,0.72,0.8, respectively. This shows the benefit of our parameterized hybrid distribution to 
achieve a superior approximation quality. Figure [5] shows the structure of the data Apo^ for 7 = 1.0 along 
with the optimal hybrid-(£ 1 , £ 2 ) distribution and leverage score distribution pi^y. The figure suggests our 
optimal hybrid distribution is better aligned with the structure of the data, requiring smaller sample size to 
achieve a desired sparsification accuracy. 

We also compare the performance of the two sampling methods, optimal hybrid and leverage scores, 
on rank-truncated Digit data. It turns out that projection of Digit data onto top three principal components 
preserve the separation of digit categories. Therefore, we rank-truncate Digit data via SVD using rank 
three. Table|6]shows the superior quality of sparse sketches produced via optimal hybrid-(£i, ^ 2 ) sampling 
for this rank-truncated digit data. 



Hybrid-(£i,£2) 

Plev 

^ — 3 

k(m+n) 

44% 

61% 

s _ c: 

k(m+n) 

34% 

47% 


Table 6: Sparsification quality || A — AII 2 /II A ||2 for rank-truncated Digit matrix {k = 3). We compare the 
optimal hybrid-(f’i,f’ 2 ) sampling and leverage score sampling for two sample sizes. 


Finally, Table|7]shows the superiority of optimal hybrid-(f'i,£ 2 ) sampling for rank-truncated (rank 5) 
Ao.i matrix for matrix sparsification. 

5.4.2 Quality of Fast PCA 

We investigate the quality of fast PCA approximation (Algorithm IDl for Digit data and Aq.i. We set 
r = 30 • A: for the random projection matrix Aq to achieve a comparable runtime of Q with T-L. Figure!^ 
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Hybrid-(.fi, .( 2 ) 

Plev 

S _ Q 

k(m+n) 

25% 

80% 

s _ c: 

k(m+n) 

21 % 

62% 


Table 7: Sparsification quality ||A — A|| 2 /||A ||2 for rank-truncated Aq.i matrix {k = 5). We compare the 
optimal hybrid-(£ 1 ,^ 2 ) sampling and leverage score sampling using two sample sizes. 



-4 

X 10 

4t ■ " 


0.2t 

0.1 



0 0 


(c) Optimal hybrid distribution for Apow 


Figure 5: Comparing optimal hybrid-(f'i,£ 2 ) distribution with leverage scores for data Apo^; for 
7 = 1.0. (a) Structure of Apop,, (b) distribution pif,y, (c) optimal hybrid-(£i, £ 2 ) distribution. Our optimal 
hybrid distribution is more aligned with the structure of the data, requiring much smaller sample size to 
achieve a given accuracy of sparsification. This is supported by Table [5] 


shows the PCA (exact and approximate) for Digit data. Also, we consider visualization of the projected 
data onto top three principal components (exact and approximate) in Figure]^ In Figure!^ we form an 
average digit for each digit category by taking the average of pixel intensities in the projected data over 
all the digit samples in each category. Similarly, Figure |7] shows the visual results for data Aq.i (we set 
k = 5). Finally, Table[8]lists the gain in computation time for Algorithm |4] due to sparsification. 



Sparsified Digit 

Sparsified Aq.i 

Sparsity 

93% 

94% 

ThlTjTG 

30/151/36 

18/73/36 


Table 8: Computational gain of Algorithm |4] comparing to exact PCA. We report the computation time 
of MATLAB function ‘svds(A,/c)’ for actual data (T^), sparsified data (T/j), and random projection data 
Ag (Tg)- We use only 7% and 6% of all the elements of Digit data and Aq.i, respectively, to construct 
respective sparse sketches. 
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(a) PCA’s (b) Projected data onto the PCA’s 

Figure 6: Approximation quality of fast PC A (Algorithm |4ll on Digit data, (a) Visualization of principal 
components as 16 x 16 image. Principal components are ordered from the top row to the bottom. First 
column of PCA’s are exact A. Second column of PCA’s are T-L computed on sparsified data using ~ 7% of 
all the elements via optimal hybrid sampling. Third column of PCA’s are Q computed on Aq- Visually, 
Ti is closer to A. (b) Visualization of projected data onto top three PCA’s. First column shows the average 
digits of projected actual data onto the exact PCA’s A. Second column is the average digits of projected 
actual data onto approximate PCA’s (of sampled data) T-L. We observe a similar quality of average digits 
of projected actual data onto approximate PCA’s Q of Aq- Third column shows the average digits for 
projected sparsified data onto approximate PCA’s T-L. 



Figure 7: Approximation quality of fast PC A (Algorithmic) for data Aq.i. Visualization of projected data 
onto top five PCA’s. Left image shows the projected actual data onto the exact PCAs A. Middle image is 
the projection of actual data onto approximate PCA’s (of sampled data) T-L. We observe a similar quality 
of PCA’s Q for Ag- Right image shows the projected sparsified data onto approximate PCA’s T-L. We use 
only 6% of all the elements to produce the sparse sketches via optimal hybrid sampling. 

5.5 Conclusion 

Overall, the experimental results demonstrate the quality of the algorithms presented here, indicating the 
superiority of our approach to other extreme choices of element-wise sampling methods, such as, £i and 
£2 sampling. Also, we demonstrate the theoretical and practical usefulness of hybrid-(f'l,f' 2 ) sampling 
for fundamental data analysis tasks such as fast computation of PCA. Finally, our method outperforms 
element-wise leverage scores for the sparsification of various low-rank synthetic and real data matrices. 
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